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Abstract —We are motivated by problems that arise in a 
number of applications such as Online Marketing and Explosives 
detection, where the observations are usually modeled using Pois¬ 
son statistics. We model each observation as a Poisson random 
variable whose mean is a sparse linear superposition of known 
patterns. Unlike many conventional problems observations here 
are not identically distributed since they are associated with 
different sensing modalities. We analyze the performance of a 
Maximum Likelihood (ML) decoder, which for our Poisson set¬ 
ting involves a non-linear optimization but yet is computationally 
tractable. We derive fundamental sample complexity bounds for 
sparse recovery when the measurements are contaminated with 
Poisson noise. In contrast to the least-squares linear regression 
setting with Gaussian noise, we observe that in addition to 
sparsity, the scale of the parameters also fundamentally impacts 
£2 error in the Poisson setting. We show tightness of our upper 
bounds both theoretically and experimentally. In particular, we 
derive a minimax matching lower bound on the mean-squared 
error and show that our constrained ML decoder is minimax 
optimal for this regime. 

Index Terms —Poisson Model Selection, Sparse Recovery, Reg¬ 
ularized Maximum Likelihood, Optimal Minimax Estimator 

I. Introduction 

I N this paper, we study sparse signal recovery problem 
with Poisson observations. This problem is motivated by 
many practical applications where the observations are the 
counts of an event. The mean count in these applications 
depends linearly on a sparse subset of parameters. Our goal 
is to extract the sparse subset from a potentially large number 
of parameters. Some of the practical applications motivating 
our problem include explosive identification based on photon 
counts in fluoroscopy tu, and eMarketing based on website 
traffic (2). 

Motivated by these applications we propose a high¬ 
dimensional sparse signal estimation problem governed by 
Poisson distributed observations. We model the rate of the 
Poisson process as a positive mixture of known signatures. The 
rates for different observations are obtained from heteroge¬ 
neous sensors or different measurement settings and therefore 
the observations are not identically distributed. Specifically, 
we model observations y L as follows. 

V* S {1,..., n} : yi ~ Poisson(A 0j i + &J w*) 

where Ao,, is the rate of the background Poisson noise and 
assumed to be known. Each a.; = [a i; i,..., tii jP ] T is a distinct 
vector corresponding to the ?-th sensor. The collection of these 
vectors form the sensing matrix, A = [ai, ..., a,,] 1 . Our goal 
is to recover the sparse vector, w*, from {y \...., y n }. The 


parameters w* £ M5 are the mixing proportions for different 
sensors. 

Our high-dimensional setting entails p^> n and the ground 
truth mixture parameter w* is assumed to be sparse with 
4 norm ||w*|| 0 = k. We derive upper and lower bounds 
for sparse signal recovery for this Poisson setting. Our up¬ 
per bounds are based on analyzing the performance of an 
l\ constrained maximum-likelihood estimator. The optimal 
ML decoder is obtained by solving a convex optimization 
problem involving a non-linear objective function. We then 
analyze performance of minimax estimators, namely, optimal 
estimators for worst-case mean-squared error to obtain lower 
bounds. These lower bounds are derived by means of Fano 
bounds and are obtained by embedding the estimation problem 
into a detection setting by utilizing Gilbert-Varshamov bound 
0. Specifically, for a given sparsity level k := ||w*|| 0 , 
and a parameter amplitude s := ||w*||i we show that the 
upper bound for the estimation error scales as \Jsk/n. We 
then obtain a matching lower bound thus demonstrating the 
minimax optimality of our l\ constrained ML estimator. 
Dependence on Scale: The dependence of mean-squared error 
on scale parameter s does not arise in conventional sparse lin¬ 
ear least squares regression setting. Indeed sample complexity 
is primarily determined by sparsity for many random matrix 
designs and sample complexity improves with scale of the 
ground truth parameter w* for a fixed level of noise 0 . In 
contrast, in our problem setting, sample complexity degrades 
with the scale of the parameter vector. Intuitively this results 
from the fact that the curvature of the log likelihood function 
decreases with scale s of the parameter vector. Indeed, for 
large values of s, partial changes <9w = w — w* in the 
parameter vector translate to significantly smaller changes 
of the likelihood function resulting in lack of identifiability. 
Consequently, unlike the conventional case we inevitably have 
to suffer the effects of scale for the Poisson case. 

Another aspect of the problem is the high dimensional 
sparse setting, which leads to fundamental issues such as the 
Hessian being singular 0, 0- To overcome these issues 
we follow along the lines of sparse linear regression and 
consider optimizing the Poisson likelihood function under t\ 
constraints. These constraints have the effect of constraining 
the error patterns to a cone of feasible directions (descent 
cone): 

w - w* £ C := {u : ||u S c||i < ||u s ||i, |S| < k} (1) 

As a result it turns out that we need to ensure that the loss 
function is “well-behaved” only on the feasible cone. A recent 
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important development 0 in this context is to impose strong 
convexity of the loss function on the feasible cone. Given 
suitable assumptions on the design matrix A, this requirement 
is generally satisfied for many loss functions including least- 
squares losses. 

Unfortunately as it turns out, our Poisson case does not 
lead to strongly convex losses. Strong-convexity of the loss 
function amounts to the assumption that we see a non-trivial 
change in the loss function as a result of underlying parameter 
variation regardless of the ground truth w*. This requirement 
can be viewed as a need for the curvature of the loss function 
to be non-vanishing on the cone. In the Poisson case, the 
perturbation in the loss function behaves linearly in large s 
regimes (i.e. the curvature vanishes in the limit) and so the loss 
function is no longer strongly convex on the cone. This issue 
motivates us to incorporate signal strength s in the definition 
of plausible error patterns C. We then characterize the change 
in loss function at various amplitudes and sparsity levels in 
terms of the changes in the parameter vector (||dw|| = e). We 
show that the loss function is strongly convex on the newly 
defined set of plausible error directions, as long as the elements 
of A are bounded and A satisfies the so-called Restricted 
Eigenvalue (RE) condition 0. 

A wide variety of sensing matrices ranging from determinis¬ 
tic to random designs can be shown to satisfy RE condition. In 
particular our results apply to both deterministic and random 
sensing matrices and we present several results for both cases. 
We also conduct several synthetic and real-world experiments 
and demonstrate the tightness of the oracle bounds on error 
as well as the efficacy of our method. Specifically, it has been 
suggested in the literature that LASSO can handle exponential 
family noise such as that arises in our application sa. It turns 
out that t\ constrained ML estimator uniformly outperforms 
LASSO and has significantly superior performance in many 
interesting regimes. 

The paper is organized as follows: In Section [II] we in¬ 
troduce the notation and state our sparse estimation problem. 
Section[III]describes our theoretical results on the convergence 
of the regularized ML decoder. The numerical results for 


different interesting scenarios are demonstrated in Section IV 


Finally, the detailed proof of the main theorems and lemmas 
are provided in Section [Vi] 


A. Related Work 

Parameter estimation for non-identical Poisson distributions 
has been studied in the context of Generalized Linear Models 
(GLMs). However, our model is inherently different from the 
exponential family of GLM models that has been studied in 
0,0, GQl, on. In particular the GLM model corresponding 
to the Poisson distributed data studied in the literature has the 
following form: 

Model I : y* ~ Poisson(exp (a^w*)) 

Therefore, the log likelihood takes the following form: 

n 

A O) = J 2 yi ( a ^ w ) _ exp ( a ^ w ) 

i=1 


In contrast, in the setting we are interested in, the observations 
are modeled as follows: 

Model II : yi ~ Poisson(Ao,i + a,/ w*) 

and the log likelihood function has the form: 

n 

£ 2 (w) = ^2 yi lo S (Vi + a 7 w ) - (Ao,i + a7 w) 

i—l 

There are several important differences between the two 
models. We observe that imposing sparsity on w in Model 

I corresponds to smaller number of multiplicative terms in the 
Poisson rates. On the other hand, w being sparse in Model 

II results in fewer number of additive terms in the Poisson 
rates of the corresponding model. At a more fundamental level 
the loss function (negative log-likelihood) for Model I has 
an exponential term (exp (a^w)). The assumptions of strong 
convexity on the feasible cone C could be proved if elements 
of A are independent sub-gaussian draws 0. Consequently, 
unlike our case, the issue of signal amplitude no longer arises 
for this model. 

We can view model I as an instance of a general class of 
sparse recovery problems. Indeed, ifTOll studies the convergence 
behavior of i\ regularized ML estimation for exponential 
family distributions and GLM in this context. The bounds 
on error for sparse recovery of the parameter are based on 
the RE condition. Moreover, in order to get useful bounds on 
estimation error of GLM, they additionally need the natural 
sufficient statistic of the exponential family to be sub-gaussian. 
This condition could clearly be violated in our setting where 
the data is Poisson distributed and there is no constraint on 
the sensing matrix to be sub-gaussian. 

More generally, 0 describes a unified framework for 
analysis of regularized M— estimators in high dimensions. 
They also mention extension of their framework to GLMs and 
describe “strong convexity” of the objective function on C as 
a sufficient condition to obtain consistency of M-estimators 
under Model I. As described earlier, this requirement of 
strong convexity on C is not consistent with our model. In 
addition the statistical aspects in that work requires that the 
components of the sensing matrix be characterized by sub- 
gaussian distributions, which we do not require here. 

Statistical guarantees for sparse recovery in settings similar 
to model II have been provided in fl2l . Ifl3ll . Ifl4l in the 
context of photon limited measurements. They assume that 
the observations are distributed as follows 

yi ~ Poisson (a^w*) 

where elements of the signal w* and sensing matrix are 
positive, and the sensing matrix satisfies the so-called Flux 
Preserving assumption: 

n 

J2 a J w * ^ ii w *iii- 

i—l 

The latter assumption arises in some photon counting ap¬ 
plications, like imaging under Poisson noise, where the total 
number of expected measured photons cannot be larger than 
the intensity of the original signal. The upper bound on 
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reconstruction error of the constrained ML estimator is given 
in the paper lfl3ll . Surprisingly, the upper bound scales linearly 
with the number of measurements. However, this sounds 
reasonable under the Flux Preserving assumption. In fact this 
behavior is due to the fact that for a fixed signal intensity, more 
measurements lead to lower SNR for each observation. As a 
result, unlike conventional compressive sensing bounds, the 
estimates do not converge to the ground truth with increasing 
the sample size. Nevertheless, Flux Preserving constraint does 
not arise in our setting and consequently the application and 
methods of analysis are different. 

In summary the fundamental differences in the underlying 
model as well as the different assumptions in the sensing 
matrices from the previous work warrants new analysis tech¬ 
niques which is the subject of this paper. 


B. Applications 

In the sequel, we will introduce two applications, which 
motivate the model described earlier: 

1. Explosive Identification: In the explosive identification 
example, an unknown mixture of explosives is exposed 
to different fluorophores. The goal is then to estimate the 
mixture components based on the observed fluorophores 
photon counts. Here y t and Ao,i could be considered as the 
photon counts and background emission rates for fluorophore 
i. Ao,i is measured before the exposure of the explosive 
mixture and is known. a%j < 0 represents a quenching effect 
of explosive j on fluorophore i, and w* is the weight of 
explosive j in the mixture (TJ. 

2. eMarketing: In the eMarketing example El, weekly 
traffic of different websites within the same market are mea¬ 
sured. The traffic of site i, denoted by y tl is assumed to 
be affected by the number of bought links to it on differ¬ 
ent advertisement websites. Each advertisement website j is 
also assumed to contribute to the visiting rate by a fixed 
weight w*. These weights could be viewed as measures of 
popularity/dominance of advertisement website j. Moreover, 
Ad , is the average traffic that visits website i directly (not 
through intermediate advertisement website) and is acquired 
through online statistics of the website. Ojj > 0 is the number 
of backward links that the business website i has bought 
from the advertisement website j. Estimation of ru*’s can 
lead to discovery of dominant advertisement websites in some 
industry. Fig. [T| illustrates this eMarketing model. 

II. Definitions and Problem Setup 

Problem Definition: Let yi ~ Poisson(Ai(w*)), with 
A,(w) := Ao,i + a^w and ai,w*eR5_. Furthermore, 
||w*|| x = s and ||w*|| 0 = k. The main objective of this 
paper is to analyze the in error of the maximum likelihood 
estimation of w* given y := (y\,..., y n ) and s. 

Likelihood Function: Let the objective function be the neg¬ 
ative log likelihood of the data given the unknown parameter 


yi 


y 2 


yn-i 

y n 



Google.com Wl 

Ebay.com W2 

Amazon.com \N% 

Pinterest.com W4 

ldeeli.com W5 


Facebook.com Wp-3 
Twitter.com Wp-2 
Tumblr.com Wp-1 
Blogger.com Wp 


Fig. 1. Illustration of eMarketing model. Node on the left are the business 
websites, Nodes on the right are the advertisement websites. A connecting 
edge, dij , is the number of backward links purchased by the business website 
i from the advertisement website j. 


w: 

1 n 

Q(w) := -V'-j/i logAj(w) + Ai(w). 
n z ' 

2=1 

We will let A m i n : = min ? ; Ai(w ), A max : = maxj Ao.zTttmaxS, 
A h '■= (1/7T- ^”_i 1/Ai(w*)) -1 (the harmonic average of 
the rates), a max := max,;j cii.j, and A := [ai,...,a„] T 
throughout the paper. 

The guarantees on the error of the maximum likelihood 
estimator relies on an assumption on the sensing matrix A 
known as Restricted Eigenvalue Condition'. 

Definition 1. Restricted Eigenvalue condition: A matrix A £ 
R" xp satisfies RE(k,jk), if there exists a constant 7 *. > 0 , 
such that for any set of indices S satisfying |Sj < k, and 
vector 


u e C(S) := {u^O: ||u s ||i > ||u S c||i}, 

we have: 

— IIAu||| > 7fc||u||l (2) 

n 

where U 5 is the restriction of the vector u to the indices in 
S, and S c = {1 ,... ,p}\ S. 

III. Main Results 
A. Upper Bound on Error 

We will analyze the performance of l\ constrained Max- 
Likelihood estimator to obtain upper bounds. Specifically, we 
consider the following estimator: 

w := argrnin Q( w) (3) 

Wi<s, Vi : Wi >0 

Theorem 1. Supose A satisfies RE(k, 7 /-), and its elements 
are positive and hounded by a max . Then, with probability at 
least 1 — C ike following error bound holds: 

54A max aLx(3 + logl“) /fclog(2/<) 

l|w-w || 2 < -y- ~\ —=-, 

7 k V A h n 
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where C > 2exp 

Mean Squared Error: Note that since the rates are uniformly 
bounded in terms of the scale parameter s and the error prob¬ 
ability can be made arbitrarily small, we can obtain an upper 
bound for the mean squared error (MSE) in a straightforward 
manner. The MSE upper bounds has an identical scaling with 
respect to sparsity, scale and the number of measurements. 
Upper Bound for Special Cases: Consider the case where 
base rates A 07 are constant and so A max = O(s). For this 
scenario we have with probability 1 — £ that, 

||w - w*|| 2 = o ( S ^ S V 7 k Io g( 2 /C)A^) 

In addition if we also have A m i n > cA max for a constant c, 
so that the harmonic mean A^ = 0(s ) and logA max /A m i n = 
0 ( 1 ), then, it follows that the error behaves as 


Other Distributions: The techniques in the proof could be 
used to extend this result to other heavy tailed distributions 
such as Gamma distribution, when the scale parameter is an 
affine function of w*. Similar to the Poisson case, the curva¬ 
ture of the loss function associated with the log-likelihood of 
Gamma distribution vanishes with scale s. 

Additional Linear Constraints: The proof could be easily 
modified when there are additional linear constraints on the 
Aj(w) values. In that case, the constraints should be included 
in the set C s . Hence, the results can be generalized to the case 
where a,; ; < 0, so long as the Poisson rates are constrained 
to be positive. 

Brief Proof Outline: The main idea of our approach is that we 
can show that the £ 2 error can be bounded by e if no feasible 
perturbation of magnitude e around the true parameter value 
can lead to a decrease in the objective function Q. Further¬ 
more, if Q is strongly convex around w*, an appropriate upper 
bound on e makes the change in the objective function positive 
for all plausible error patterns. Finally, the upper bound on e 
also yields an upper bound on the £9 error. As the loss function 
associated with Poisson distributed data is not strongly convex 
on C, we replace the cone C (Eq. (jTJ) by another set C s , which 
depends on s. This leads to a loss function that is well-behaved 
and subsequently we can derive our results. 

B. Sparsity Scaling 

Theorem [l] describes an upper bound for estimation error 
for matrices belonging to RE(fc, 7 /c). Since 7 /, in turn depends 
on sparsity k, the dependence of estimation error on sparsity 
is implicit. One difficulty in making this dependence explicit 
is that RE condition for matrices that are positive and bounded 
has not been extensively studied. The best known bound is due 
to lfl5l for positive and bounded matrices, who shows matrix 
constructions with n = fl(k 2 \ogp) satisfying RE condition 
with a constant 77 that is independent of k. This in turn implies 
that for positive bounded matrices it is sufficient to have n = 
Ll(k 2 log p) measurements for our estimation error bounds to 
hold. 


Nevertheless, this introduces a seemingly large gap between 
sparse linear regression results where the sample complexity 
grows linearly with sparsity and our corresponding Poisson 
model. It is unclear whether this gap is due to matrix con¬ 
struction or our proof bounding technique. To bridge this gap 
we consider a different matrix construction and a variant of 
our optimization problerrQ 

In particular, we construct our matrix based on another 
elementary matrix A g . The components of A,, are i.i.d. in¬ 
stantiations of a bounded sub-gaussian random variable, whose 
minimum and maximum values are denoted as —a A < 0 and 
a v > 0, respectively. The main reason for this construction is 
that we can now obtain designs with linear scaling, namely, 
for n = 0(klogp) it is possible to construct matrices with a 
constant factor 7 bound (T 6 l in Eq. [ 2 ] We emphasize that it 
is important here for a l3 to take both positive and negative 
values for this to hold. 

We now construct matrices A based on A,,, namely, 

A = A g + a/\l nxp (4) 

where l nxp denotes an all one matrix of size n x p. Finally, 
we consider a variant of our £1 constrained problem: 

w := argmin Q(w) (5) 

. Wi=s, Vi : >0 

The main difference between Eq. [3] and Eq. [5] is that the 
constraint ||w||i < s is replaced by ||w||i = Yhi w i = s - 
While this equality constraint is somewhat more restrictive it 
allows us the flexibility of utilizing matrices with negative 
components. Furthermore, it bridges the gap between the 
sparse linear regression setting and this problem as shown in 
the following corollary. 

Corollary 1. Let n > Cklogp, a max := max(av,aA)> an d 
A be constructed according to Eq. 0 , then the estimator w, 
of Eq. [5] satisfies the following bound: 

11 ~ *11 ^ \ 2 /fc log ( 2 / (f) ( A max ^ 

l|w - w || 2 < c 0 A max a max W--- 3 + log -- , 

y A/2,77- \ ^min / 

with probability of at least 1 — £ — ci exp(—c 2 n), where C, 
Co, Ci, and c 2 are positive universal constants, and £ satisfies 
the conditions of Theorem 1. 

C. Minimax Mean Squared Error Lower Bound 

In this section, we derive a lower bound for the mean- 
squared error and show that the upper bounds obtained in the 
previous section match our lower bound. Our proof technique 
is based on a generalization of Fano’s inequality (33. In 
particular, we show that no estimation algorithm can have a 
worst case mean squared error (MSE) less than 0(yjsk/n ): 

Theorem 2. Let q and a m i n be the maximum eigenvalue of 
A/y/n, and minimum element of A, respectively. Furthermore, 
let Q := {w : ||w||o < k , ||w|| 1 < s, \/i : Wi > 0 } be the 

*We thank the anonymous reviewer for suggesting this scheme. 


w — w 


= 0 


( \Mlog(2/C)/ 
\ 7 k 
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s 


Fig. 2. Linear scaling of average A ^ w.r.t signal intensity s for three different 
random constructions of A. 


set of feasible parameter values. If n > Ca m [ n (k — l) 2 /(srj) 
and k > 9, the minimax error rate is bounded as follows: 


inf sup E(||w — w*|| 2 ) > C' 

w w *g S 

where w is an arbitrary estimation algorithm, and C and C' 
are two constants. 

Matching Lower Bound: The error bound in Theorem 2 
matches (order-wise) the upper bound in Theorem 1 as long 
as A m j n = Q(s). This establishes the fact that li constrained 
ML decoder is minimax optimal in the regime where the 
Poisson rates vary within a constant factor. 

Brief Proof Outline: Let Qm be a subset of Q of size M. We 
consider Qm as a set of hypotheses. It follows from Fano’s 
bound fl7l that the worst case MSE of any estimator is lower 
bounded by 0.5P e S, where P e is the hypothesis test error 
probability and S is the minimum £2 distance of any two 
members of Qm- We are now left to construct an appropriate 
hypothesis subset. We utilize the Gilbert-Varshamov bound ft3j 
in this context to construct a subset such that S is lower 
bounded by f sk/n) for M = f2(exp(fc/8)). Furthermore, 
P e can be lower bounded through the Fano bound by ensuring 
that KL-divergence of any two observation distributions with 
parameters in Qm is no more than 0(k). The detailed proof 
is provided in the Appendix, Sec. |VI-D| 

IV. Numerical Results 
A. Poisson Rates Harmonic Mean 

In this section, we consider three different random matrice 
constructions and verify whether A h scales with s. This linear 
scaling together with the bounds derived in Theorem 1 and 
Corollary 1 imply that the error matches the minimax rate 
0(y/ks/n) up to a log s factor. Specifically, we consider i.i.d. 
copies of three different distributions to construct A. Namely, 
uniform distribution on [0,1], Beta distribution with a = 1 



and A = 3, and a random variable V defined as 
f 0; if Z < —1 

V = 1 Z + 1; if — 1 < Z < 1 (6) 

[ 2 ; ifZ>l 

where Z is a normally distributed random variable with // = 
0, a 2 = 0.25. We take the average of A/, over 50 Monte 
Carlo runs. In each run, A and w* are selected at random. 
Elements of w* on the support are selected from the uniform 
distribution on [0,1], and the support is equiprobable across 
all possible supports. The average value of harmonic mean is 
plotted against s in Fig. [2] This verifies the linear scaling of 
Xh w.r.t s. 


B. Tightness of the Error Bounds 

Theorem 2 characterizes the minimax optimality of i\ 
constrained ML decoder in a certain regime where rates vary 
within a constant factor. On the other hand the upper bound 

in Theorem 1 and Corollary 1 scales as , s _ As noted in 

v A hn 

Fig. Lithe Harmonic mean. A/,, generally scales linearly with s. 
Consequently, the upper bound matches our worst-case lower 
bound. Nevertheless, it is still possible that our upper bounds 
for ML constrained decoder is conservative for the average 
case. In this experiment we attempt to show that the upper 
bounds are indeed tight for the average case as well. 

These reasons motivate us to perform experiments with 
synthetic data to verify the tightness of the derived bound 
in Theorem 1. Elements of A are drawn from the uniform 
distribution over the unit interval. We also consider realiza¬ 
tions where the elements are i.i.d. instantiations of a suitably 
truncated Gaussian random variable defined in Eq. 

We set the dimensionality p = 400, sparsity level k = 5, 
sample size 100 < n < 300, signal strength 1 < s < 2 x 10 4 , 
and the base rate Ao = 4. We fix the direction of w* 
throughout the experiment and change the scale parameter 
s. We also fix A throughout all the experiments. Therefore, 
the uncertainty in drawing Poisson distributed data is the 
only uncertainty across all the experiments. The experiment 
is repeated (for the same values of w* and A) for 10 times. 
Notice that for the truncated normal distribution, we set some 
components of design matrix to zero. 

Recorded points in all the experiments are then plotted. We 
use the interior-point optimization algorithm implemented in 
MATLAB with the termination tolerance value of 10“ 10 on the 
objective function. The termination tolerance on the variables 
is also set to 10 -10 . The gradient of the objective function is 
also given to the optimization algorithm initially. 

In each run of the experiment, the ( 2 norm of the error is 
recorded. The t 2 error is plotted against s/s/nXh in Fig. [ 3 ] 
The hypothetical dashed lines pass ing through the origin shows 
that the error scales as s/yfnXh with high probability. This 
suggests that our upper bounds are tight even for the average 
case. The estimation error is also a function of sparsity level 
k. We next experiment with varying levels of sparsity, k, while 
fixing the scale s. These results are consistent with our bounds. 
This is not surprising because both our lower and upper bounds 
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p = 400, k = 5,100 < n < 300,1 < s < 2 x 10 4 , A 0 = 4 p = 400, k = 5,100 < n < 300,100 < s < 2 x 10 4 , Ao = 4 



Fig. 3. i 2 error vs. scaled derived error upper bound. This demonstrates that actual error lies within two linear scalings of the derived bound with significant 
probability, which in turn shows that derived bound is tight. Entries of A are drawn from the uniform distribution over [0,1] (left) and an alternative distribution 
defined in Eq. (right). 


p = 400, n = 300, s = 1000, A 0 = 0.01 

5 ° -,- 1111 -,— 


40 


C4 30 

I 25 
20 


10 


o U -1-1-1-1-1-1-1-1- 

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 

Vk 

Fig. 4. £2 error changes with \/k when Ai j ~ U( 0,1). This is consistent 
with the upper and lower bounds in Theorems 1 and 2. 

suggest linear scaling of squared error with \Jk. This has been 
illustrated in Fig. [4]for the case that A l :) ~ U (0,1). We do not 
consider experiments in which dimensionality p is changed, 
simply because the derived bounds do not explicitly rely on p 
as long as RE condition is satisfied. This happens for random 
constructions, when n is large compared to k and log p, i.e. 
n = O(klogp). 

C. Gaussian vs. Poisson Model 

In this section, we compare the £2 norm error of he 
constrained maximum likelihood decoder, by using a Gaussian 
likelihood function even though the observations are generated 
using a Poisson model. We then compare the resulting estima¬ 
tion error with the estimation error obtained using a Poisson 
likelihood function. Elements of A are generated according to 
the Beta distribution with a = 1 and /3 = 3, and are sampled 
independently during 500 Monte Carlo runs. During each run 
a weight vector w* £ Ri 00 is chosen at random, and n = 100 


Poisson observations are made based on the model described 
in Sec. [D] We set Ao = 0.01 and change s from 1 to 1000 or 
alternatively fix s = 200 and change Ao from 0 to 10. We plot 
the average of ratio of £2 error in the two cases in Fig. [5] This 
figure shows that Poisson model outperforms the Gaussian one 
especially when both s is large and Ao is small. This could 
be partly explained by the fact that under this condition, the 
variance of observations is small relative to the mean. Hence, if 
the background noise is small, each observation would convey 
more information about w*, making performance of the true 
model significantly better than the conventional Gaussian one. 

D. Rescaled LASSO vs. Regularized ML 

Parameter estimation based on LASSO for the Poisson 
setting has been studied in 0. The idea is to view the problem 
as an additive noise problem, where noise belongs to an 
exponential family of distributions. Alternatively, in {§] the 
problem is viewed as an additive Gaussian noise problem with 
noise variance being equal to its mean to mimic “Poisson like” 
behavior. This results in a rescaled version of LASSO, which 
is then used to estimate model parameters. This amounts to 
scaling the loss function associated with each observation by 
the corresponding mean value (or equivalently the variance). 

This approach motivates us to compare the regularized ML 
method against rescaled LASSO for poisson distributed data. 
This will highlight the essence of using regularized ML instead 
of rescaled LASSO for our setting. In this section, we will 
demonstrate that our regularized ML outperforms rescaled 
LASSO in several regimes including low SNR, high dimen¬ 
sions, and moderate to low sparsity levels. We use probability 
of successful support recovery as a measure of performance. 
We use this measure because of the scale invariance property 
of this measure. It avoids the issue of signal strength s in 
impacting the performance metric and obviates the need to 
normalize the measure with respect to s, as we did in the last 
section. 

To compare the performance of regularized ML and rescaled 
LASSO, we first generate a random sensing matrix A £ R nxp 
where each element is an independent truncated Gaussian 
random variable. We then generate a random w*. To recover 
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Fig. 5. I 2 errors of Maximum Likelihood estimation assuming a Gaussian observation model vs. assuming the true Poisson distribution. The ratio of errors in 
two cases across different signal amplitudes (left) shows that the true Poisson model performs constantly better than the Gaussian one when signal amplitude 
is large. The same plot across various Ao values (right) shows that the difference between two models is significant when the background noise is small. 
These two plots together suggest that under high SNR values there is a significant difference in the performances of the two models. 


w*, we draw n Poisson distributed samples with coefficients 
specified in A as: 

Hi ~ Poisson (Aq + a^w*) 


We first solve the non linear optimization where w is con¬ 
strained to satisfy V* : Wi > 0, Wi < s. 

1 " 

W ML = arg min - V' Vi log(A 0 + a7 w) - a^w 

Vi : Wi>0 J '£2 i ^ 2 — I 

For the purpose of comparison, we also compute the rescaled 
LASSO estimator: 


W LS = 


Vi 


arg mm 


1 y' (Vi - *0 ~ a J w ) 2 
n ~[ A 0 + w 


We then threshold the solution by zeroing out components 
of Wml and below a pre-defined small threshold t. We 
average the estimation performance over 100 Monte Carlo 
runs. The performance of the two methods are compared in 
Fig. [6] and Fig. |7| The results are compared for different 
number of observations n, and different sparsity levels k, 
respectively. 

In Fig. [6] we compare the performance of regularized ML 
estimation with that of rescaled LASSO as a function of n. 
We fix Ao = 100, p = 400, t — 10 -4 , s = 1 and k = 40. At 
each iteration, we estimate w* based on n observations where 
n varies from 2 to 400. We compare the performance of the 
two approaches based on probability of successful recovery of 
the support set. This error is 0 if the thresholded support set 
of the estimation is equal to that of the ground truth and 1 
otherwise. We average this measure over 100 samples of w* 
for a fixed A. 

In Fig. [7] we compare the performance of regularized ML 
estimation with that of rescaled LASSO for different sparsity 
levels, k. This time, we fix Ao = 100, p = 200, s = 1, 
and n = 100. For each k, we generate 100 samples of k- 
sparse w*’s and recover them from n observations. Since 


Regularized ML VS Lasso for p=400, X. Q =100, k=40 



Fig. 6. Probability of successful support recovery as a function of n for 
p = 400, Ao = 100, k = 40, t = 10 -4 , and m = 100 Monte Carlo 
loops. This figure illustrates twice faster convergence of probability of success 
with respect to number of observations for Regularized ML in comparison to 
Rescaled LASSO. 


||w*||i = 1 for all values of k, we threshold each element 
of w by t = ^1, to obtain their sparse support set. We 
measure the performance of the two estimators based on 
average probability of successful recovery of the thresholded 
support set for each value of k. Notice that the error bars 
in Fig. [6] and Fig. [7] indicate that the difference between the 
methods is indeed statistically significant. 

In Fig. [8] we compare the result of regularized ML esti¬ 
mation with that of rescaled LASSO in terms of the ROC 
curves. In our ROC curve, we plot the average number of 
true detections against the average number of false alarms. 
True detections are indices that are common in the thresholded 
estimated support set and that of the Ground Truth, whereas, 
false alarms are the indices in the thresholded estimated 
support set that are not included in the support set of the 
Ground Truth. This time, we fix Ao = 100, p = 200, n = 100, 
s = 1, and k = 20. We fix a sensing matrix A, and generate 
100 random w*’s. By applying different thresholds t = \ to 
t = °' ( / |° l we obtain the different points in the ROC plot. We 
average Probability of Detection (PD) and Probability of False 
alarm (PF) over 100 Monte Carlo runs. 




























Regularized ML VS Lasso for p=200, n=100, A 0 =100 



Fig. 7. Probability of successful support recovery as a function of k for 
p = 200, Ao = 100, n = 100, and m = 100 Monte Carlo loops. As 
this figure suggests, probability of successfull recovery drops faster when the 
sparsity level increases for Rescaled LASSO in comparison to Regularized 
ML. This shows robustness of ML approach to model parameters. 


ROC curve for ML and Lasso 



Fig. 8. ROC curve for Regularized ML and Rescaled LASSO for n = 100. 
p = 200. k = 20, A = 100, and in = 100 Monte Carlo loops. This 
figure illustrates the superiority of Regularized ML in comparison to Rescaled 
LASSO for parameter estimation under Poisson models. 

E. Explosive Identification 

In this experiment, we first measure the light intensities 
of different fluorophores before and after separate exposures 
to a unit weight of different explosives. The intensities are 
measured by counting the number of photons received at each 
photo-sensor. Each explosive j has a unique quenching effect 
in the fluorescence property of each fluorophore i, which 
we denote by a, ( r In the experimental setting. A, is the 
before exposure intensity for fluorophore i and is estimated 
by averaging the before exposure photon counts from multiple 
experiments. Therefore, A,’s can be assumed to be known. We 
model the after exposure intensity of j-th explosive in sensor 
i as : yi ~ Poisson(Aj(l — ajj)) 

In the next step, fluorophores are exposed to an unknown 
mixture of these explosives. The goal is to recover which and 
how much of each explosive is contained in that mixture. 

The physics of the problem suggests that when the fluo¬ 
rophore is exposed to a mixture of explosives, the quenching 
effects are additive in the regime where the mixture weights 
are small IS. Therefore, our observations are best modeled 
by a Poisson distribution with additive rate model for each 
fluorophore: 

Vi ~ Poisson (Aj(l - a7w*)) 

where w* is the amount of the explosive j in the mixture. 
We solve this problem through Regularized ML and Rescaled 
LASSO and compare the results. 


In this problem, matrix A, the responses of n = 8 fluo¬ 
rophores to p = 12 basic explosives is given. Based on this 
given data and our additive model for mixtures, we generated 
10 mixtures by combining up to 3 random explosives. We 
used Regularized ML and Rescaled LASSO to identify these 
mixtures through their effect on fluorescence property of our 
fluorophores. The result is shown in the form of a 10 x 12 
grid in Lig. [9] In this grid, rows are different mixtures and 
columns are different explosives. Dark squares indicate the 
absence (or negligible contribution), whereas lighter squares 
indicate higher amount of the corresponding explosive in the 
associated mixture. 



Fig. 9. Sparse recovery results for k < 3. From left to right: Ground Truth, 
w*. ML estimate of w*. Rescaled LASSO estimate of w*. Columns: Basic 
explosives. Rows: Synthesized mixtures with k basic explosives. Non-black 
squares at each row show the explosives that the corresponding mixture is 
composed of. Lighter colors show larger amounts. 


F. Internet Marketing Application 

This application presents a different scenario where the 
ultimate objective is prediction and inference. These exper¬ 
iments are useful because they further our understanding 
of Poisson models in terms of the why/how and in what 
contexts Poisson models are more meaningful in comparison 
to Gaussian models for Poisson distributed data. While the 
goals here are outside the purview of the theory we have 
developed, it nevertheless serves the purpose of understanding 
sparse high-dimension Poisson models in the context of not 
only estimation but also for prediction. 

We discuss the eMarketing application in this section, where 
we will illustrate Poisson model is more meaningful than the 
conventional Gaussian noise model. In this application, our 
goal is to identify the most effective advertisement websites 
that result in higher website traffic in the clothing market. 
Our assumption is that the website traffic is generated as a 
superposition of the traffic generated from current customers 
and the traffic from advertisement through backward links 
(links in advertisement websites that are linked to these 
business websites). In general, big business websites typically 
buy a total of 1000-1500 backward links from a number 
of advertisement websites. However, the hypothesis is that 
only a few of these advertisement websites are efficiently 
directing costumers. Our goal is to identify those dominant 
advertisement websites. 

We model the number of daily visits, yi, by: 

yi ~ Poisson(Ai j o + w*) 

where A 0l models the current customers who visit the site 
directly and is obtained through online statistics of the website. 
Specifically, a long run average of traffic which is not referred 
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by any advertisement website gives a reasonable estimate and 
so we can assume that Ao,i’s are actually known. This traffic 
could be logged and acquired through online statistics of the 
business websites. 

Moreover, u, :l is the number of backwards link for the 
website i in the advertisement website j. Our model assumes 
that each of the backward links brings independent traffic 
to the website. Therefore, we used the Poisson distributed 
random variable described earlier to model the number of 
visits to a business website. 

Our observations are the daily online visits to 50 top 
clothing brands. From the information provided in alexia.com, 
we chose the top 150 advertisement websites for these brands 
along with the number of backward links for each website. 
Our goal is to recover the weight vector w*, where w* is a 
measure of dominance for advertisement website j in clothing 
market. We recover w* via regularized ML and rescaled 
LASSO (weights smaller than 0.01 are theresholded to 0). 
The result is provided in Table 1. In this table, we illustrate 
the corresponding score for each popular website based on 
their dominance in advertising for clothing brands. 

TABLE I 

Top backwardlist websites for clothing brands using 
Regularized ML and Rescaled LASSO 


Backward link 

ML estimated weight 

Amazon 

0.32 

Twitter 

0.21 

Pinterest 

0.17 

Google 

0.15 

Blogger 

0.06 

Bing 

0.05 

douban 

0.01 

tumblr 

0.01 

Backward link 

Lasso estimated weight 

Amazon 

0.35 

Pinterest 

0.17 

Twitter 

0.16 

Google 

0.16 

Bing 

0.13 


To compare the result of the two approaches mentioned 
above, we use the Bayes factor m and predictive held- 
out log likelihood comparison mentioned in fl9l . It should 
be mentioned that these tests are interpretable only when 
the number of parameters are comparable in the hypothesis 
models. In our problem, in fact, the two models have equal 
number of parameters. 

Given a set of observed data yi,...,y n , and a model 
selection problem in which we have to choose between two 
models, Bayesian inference compares the plausibility of the 
two different models Mi and M 2 through a likelihood test: 

Pr (y 1 ,...,y n \M 1 ) < ^ 
Pr( yi ,...,y n \M 2 ) > 

When the parameters of models Mi and M 2 are not known a 
priori, in Bayes factor test, we estimate them from yi, ■ ■ ■ ,y n 
and then use those estimations in computing the likelihood 
ratio. On the other hand, in predictive held-out log likelihood 
comparison, we divide the data into two groups. We estimate 
the model parameters for Mi, and M 2 using the first group 



Fig. 10. The Bayes Factor Ratio for regularized ML and rescaled LASSO. 
As we can see, higher Bayes Factor suggests that Poisson setting introduced 
earlier is a better model for this problem compared to the traditional mean 
squared loss. 


of data, and we compare the likelihoods for the second part 
of data given M- t and M 2 specified by the first group. 

Since Poisson is a PMF distribution on integers, to compare 
the two models using Bayes factor, we need to superimpose the 
Gaussian distribution on a histogram defined on integer valued 
2 /i’s. For a Gaussian distribution characterized by A/"(/x, cr), the 
value of the histogram at each integer valued y is computed 
as: 


(7) 

It is easy to show that this histogram corresponds to a valid 
PMF. We denote this PMF by W(/r, cr). 

After this conversion, the Bayes factor as a function of 
sparsity level, k, is calculated as: 

Pr (yi, ■ ■ ■, y n \yi ~ Poisson(A 0 ,i + a) - 
BF k ~ 

Pr (yi, • ■ • ,Vn\yi ~ A/" (Ao,i + a J w£ g , A 0 ,i + aT w£ s )J 


where w k ML 


and w ls 


are k sparse thresholded approxi¬ 
mations of w ml and w/, 5 , respectively. The Bayes factor 
log curve as a function of sparsity is presented in Fig. 


10 


To compute the predictive held-out log likelihood for each 
method, we first use 80% of the data (40 training data) to 
calculate an estimation of the parameters, w ml and w is- 
We use w ml and w ls for each model to compute the log 
likelihood function for the remaining 20 % of data (10 test 
data): 


10 

k~-ML = F, -^O.i-aJ^ML+yi log(A 0 ,i+a) r WML)-log(?/,;!) 

i= 1 


C-LASSO = ^ — log (q(\J Ao,i + w ls)j + 


log I Q 


Vi - Ao,i - a7 w L s ’ 


\/v 


-Q 


yi + 1 - A;, 0 - ajw LS 


i + a,' w ls 


\J Ao,i 


+ a,' w ls 


where Q stands for the tail probability of standard normal 
distribution here. Intuitively, the model that is closer to the 
ground truth results in higher log likelihood value. The log 
likelihood values for the two approaches are shown in Fig. 
ED The large gap between the predictive log likelihood of the 
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Fig. 11. The Predictive Held-out Log Likelihood Comparison. The huge gap 
between the Predictive Held out Likelihood of Regularized ML and Rescaled 
LASSO implies that Poisson is the right model for this problem therefore, 
Regularized ML approach should be taken. 



Fig. 12. Dynamics of SEM and SMM over time for clothing market. This 
figure gives a quantitative comparison of the two most controversial methods 
of online marketing. 

two models implies that Poisson is a better underlying model 
for this application. 

G. Dynamics of Online Marketing 

In the previous section, our results show that ML estimator 
and Poisson model outperforms LASSO approach for the prob¬ 
lem of online marketing. Therefore, in this section, we apply 
ML method to estimate the weights, w*, for the advertisement 
websites over time. 

TABLE II 

Top advertisement websites for clothing market in 2013 and 

2008 


demonstrates the dynamics of SMM and SEM, the most 
controversial forms of online marketing, over time CD. Al¬ 
though SEM has been thought to be the most powerful media 
marketing tool, recent empirical studies show the growing 
influence of SMM during the last couple of years l20l . The 
gigantic size of social media coupled with the relatively low 
cost per impression and the so called word of mouth have 
made SMM a powerful marketing tool. Our results confirm 
the significant influence of SMM relative to SEM since 2012. 

V. Conclusions 

In this paper we proposed a high dimensional sparse es¬ 
timation problem where the observations are governed by 
heterogeneous Poisson rates motivated by applications such 
as Online Marketing and Explosives detection. Unlike least- 
squares linear regression setting, we showed that the scale of 
the parameter has a significant effect on sample complexity. 
In particular we derived upper bounds by analyzing the per¬ 
formance of an i\ constrained decoder and develop matching 
lower bounds for worst-case mean-squared error. We presented 
a number of synthetic and real-world experimental results 
to not only demonstrate the importance of scale in Poisson 
problems but also the superior performance of l\ constrained 
ML over rescaled LASSO for a number of problems and 
metrics that are of practical interest. 

VI. Appendix A 

A. Definitions 

Definition 2. Strong Convexity : A differentiable function 
Q(.) : R p -A R. is called Strong Convex around a point w* 
over a set of perturbations C when the following holds for 
constants n , and t : 

VA G C : Q( w* + A) - Q(w*) 

-(VQ(w*),A)>«||A|||-t||A|| 1 . 

B. Proof of Theorem 1 

The proof technique is similar to the one in 1211 . with the 
exception that the allowed perturbation set C s now depends 

on s. Let IF := Q(w* 4- A) — Q{ w*), and 


Backward link 

Estimated weights in 2013 

Estimated weights in 2008 

Twitter.com 

0.18 

0.02 

Facebook.com 

0.18 

0.02 

Pinterest.com 

0.14 

0.00 

Amazon.com 

0.28 

0.22 

Google.com 

0.15 

0.52 

Bing.com 

0.05 

0.00 

Yahoo.com 

0.00 

0.11 


A brief look at Table II shows how w*fi s have changed 
dramatically over time. To study this change closely, we 
estimated w*’s for different advertisement websites from 2004 
to 2013. We group the Social networks, such as facebook.com, 
twitter.com, pinterest.com, etc, together to study the effect of 
Social Media Marketing (SMM). We also group search en¬ 
gines, such as google.com, yahoo.com, bing.com, etc, together 
to represent Search Engine Marketing (SEM). We add the 


scores of the corresponding websites in each group. Fig. 12 


C s (w*) := {w — w* :\/i Wi> 0, ||w||i < s}. (8) 

For the sake of brevity, we will use C s to refer to this set 
in the rest of the paper. We first show that if T > 0 for all 
A £ := C 5 PI {A : || A|| 2 = (5}, then |jw — w*|| 2 < 6. 

First note that for any Ao = (wo-w*) G C s , and 0 < a < 1, 
aAo also belongs to C s : 

aA 0 = a(w 0 — w*) = aw 0 + (1 — a) w* — w*. 

But we have ||aw 0 + (l — a)w*||i < a||wo||i + (l —a)||w*||i < 
s. Moreover all elements of awo + (1 — a)w* are positive. 
Hence aA 0 belongs to C s . 

Next suppose A 0 := w — w*. Note that A 0 G C s . We will 
assume || Ao ||2 > 5 and show that for at least one element of 
C s (T Ka, F r < 0. Let Ai = pj Aq. Based on the earlier 
argument, it is clear that Ai G C s . Furthermore, it belongs 
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to K^. Next, as .F(A) is convex, Jensen’s inequality could be 
used to show that .F(Ai) < 0: 


T 




< 1 - 


6 

IA 0 1| 2 

+ TT 


^(0) 


v 0 II 2 


- J-(Ao) < 0. 


The last part follows by noting that ,F(0) = 0 and .F(Ao) < 0 
as w is the minimizer of Q. 

Hence to bound the £ 2 error by S, we need to show that T 
is positive all across C s flKj. But, using Lemma [2] T could 
be lower bounded as follows: 


•F> k ||A||2-t||A|| 1 + (VQ(w*),A> 


where n and r are defined in Lemma [2] We also have 


l(VQ(w*), A)| < ||VQ(w*)|| 00 ||A|| 1 < vn\\*\\u 

where the last step has been derived using Lemma [4] and v n 
is defined therein. Putting all these together: 

T>K\\A\\ 2 2 -2(T + is n )Vk\\A\\ 2 , (9) 

where in the last step we have used the fact that: 

IIA|| 1 = ||As||i + ||A^o|| 1 < 2||As||i 

< 2\Z*||A 5 || 2 < 2V*||A|| 2 , 

where S = Supp(w*). (i) follows because for any member 
of C s , || Ag||i > || A^c || 1 (check L emma [T) , and (ii) holds 
because for any vector x, ||x||i < A/dim(x:)||x|| 2 . Now using 
Eq. it is easy to see if ||A|| > 5 := 3(r + v n )\fk/n, 
T > 0. This completes the proof. □ 

Lemma 1. For any A £ C, defined in Eq. d8ll, II Ac Hi > 
Ag||i, where S = Supp(w*). 

Proof. From the definition of C s in Eq. ([8]), we know: 

II As||i = ||w s - w*||i > ||w*||i - ||ws||i 
Moreover, from ||w||i < s for all A £ C s , we have: 

||ws||i + ||w S c||i = ||w||i < s = ||w*||i 

Therefore, 

||A S c||i = ||w S c||i < ||w*||i - ||ws||i < I!As||i 

□ 


Lemma 2. Let Q be the negative log likelihood of the Poisson 
model introduced in Sec. |7/] and inax,.j a,;.j < a m ax- If A 
satisfies RE(k,^k), Q will be Strong Convex around w* over 
the perturbation set C s (w*) (defined in Eq. ®f°r 

K ' 9A ’ 


T :— ®max(4 T 2 log(A max / A m in)) 


log(2/C) 

nXh 


with 
2 exp 


robability of at least 1 

y?.Amin m in(l 4 min) 

4A h 


Q where £ > 


Proof. Let 

SQ := Q(w* + A) - Q(w*) - (VQ(w*), A), 

and A := w — w* £ C s . Using some algebraic manipulations, 
SQ could be written as follows: 

1 ” 

SQ = - V -Vi log (1 + a J A/Aj(w*)) + Via J A/A,(w*). 
n z ' 

2—1 

Leting di := - log (l + A/Ai(w*)) + a^ A/Aj(w*), it 

can be simplified to: 

1 n -\ n 

SQ = - V] — Aj(w*)dj + - V7t/j - Aj(w *))di. 


i=1 


i=l 


SQ 1 


6Q2 


Assuming that A satisfies RE(fc, 7 fc), 5Q 1 can be lower 
bounded for all A £ C,(w*) using Lemma [3] 


SQi > 


7 fc ||A||I 

9A 


On the other hand, absolute value of SQ 2 could be shown to 
be upper bounded as: 

( 


\SQ 2 \ < 


\ 


max — Aj(w*) log ( 1 + 


t.Aec 


a} A 

Aj(w*) 


- a A 


V 


SQ 2,1 

\ 


1 n 

- ly*/ A i( w *) - !| 

71 ^-/ 


V 


SQ2 




Suppose aj A < 0. Then, SQ 2.1 can be upper bounded as 
follows: 

SQ 2 ,i< (A»(w) - aj A) log (l + a * A 

\ Xi(wQ 

The bound could be simplified to obtain: 


a A 


SQ 2 ,i < 


Ai(w) log 1 + 


a,' A 

Ai(w*) 


ai A log 1 


a A 


a A 


M w *), 

Then, the first order Taylor expansion of the first term 
around zero yields: 

a7 A 


SQ 2,1 < Aj(w) 


1 


A 8 (w*) 1 + Cj 

“1“ ®max (IT log(A max /A min )) | A || 1 , 


where c,- is a number between , ) and 0. But note that 

1 Ai(w*) 


a A 


Aj(w) 


therefore, 


Aj(w*) Aj(w*) 

A»(w) 


- 1 < 0, 


1 T Ci T 


A,:(w*)' 
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Hence, we have 

SQ 2,1 < (2 ®max ®max log(A max /A m i n ))|| A||i. 

Now suppose a7A>0. We can then upper bound dQj.i as 
follows: 


SQ2,1 < 


A*(w*) log (1 + 


a7 A 


a7 AI . 


SQ 2 .I < 


a^A 1 

A»(w* x —— 

Ai(w*j 1 + Cj 


+ Ia7 A 


SQ 2,2 7 2a n 


i=l 


Hence 


1 n / T a N 

SQi > — -A max log (1 + — ) + a7 A 


i=l 


An 


Amax , ( , a 7 a 7 A 

£-10g 1+7=— 1 + 


i= 1 


An 


An 


X,; = 


a , A 


1 II Til 2 1 v2 1 ( a 7^) 2 \ 7k||A| 


n *~i A max 


where the last inequality follows from the fact that ||A||i < 
2s for all A £ C s .Therefore, SQi can be lower bounded as 
follows: 


\ M W *L 

Using the Taylor series expansion of the first term, we obtain: 


1 ^ ^ 

SQi > min -A m axV' 

l|A|| 2 >7fc xtJ * =1 ■ 

Using some algebra, we obtain: 


Xf. 


/—/ n _|_ 2sa max \2 1 

1. 1 ' A m nv 


An > ^ll A H 2 

SQl ~ ~9A-' 


where c,: is a number between 0 and A ; w , ;i . Hence we get: 
SQ 2,1 < 2a m ax||A||i. Considering both cases, we arrive at: 

SQ2,i < (2 ®max “1“ ®max log ( A max / Amin ) ) 11 A111. 

On the other hand, using Corollary [2] SQ 2.2 is upper bounded 
with probability of at least 1 — £ : 


□ 


Lemma 4. The following bound holds with probability of at 
least 1 — C 

||VQ(w*)|| 00 < v n , 


where 


/ log(2/0 _ 

n\ h 

Using the last two inequalities completes the proof. □ 

Lemma 3. If 

1 n 

SQi ~ - Y ~M W *) log (l + a 7 A/Aj(w*)) + a 7 A, 


v n :— 2a n 


I l°g(2/C) 

n\ h 


where e > 2exp 

Proof The derivative of Q could be stated as 


ViSLi 


VQ(w*) = Y ~ w M + a >- 


and A satisfies REtkj'yk), then SQi > Tijj A ^ 2 for all A £ 

lx A m ax 

C B (w*). 

Proof We use the following inequality to lower bound SQi : 


Hence 


a < b => alog(l + x/a) < 61og(l + x/b) (10) 


||VQ(w*)|| 00 < Y I Vi/X l (w*) - 1|. 

n z —' 

i =1 

Therefore, using Corollary [2] with probability of at least I 

||VQ(w*)|| 00 < 2a m 


/log(2/C) 


nX h 


( 11 ) 


( 12 ) 


where (,>2 exp ^ 


-rcA min min(l,A„ 
4A h 


□ 


and from inequality — log(l +x) > —x, we can show that for 
all A, SQi > 0. We make the following change of variables: 


Since by Lemma[T| all A £ C s satisfy the precondition of the 
lower bound in RE condition, we have: 


Lemma 5. Bernstein Inequality : Let y,, 1 < i < n, be 

independent random samples with means Ai,..., A„. Suppose 
3 L > 0 such that Vm £ N 

E[|tfi-A i\ m ] < ^E[( yi -A f) 2 }L m ~ 2 m\ 

Then the average absolute deviations from the means can be 
bounded with high probability: 


Pr I -Y i yi - A *i - 

\ n 


2 1 


Ymvi-Ky 

\ %=1 


> 1 — 2 exp (—t 2 ), 


Now, by applying Taylor series expansion around X, = 0 to 
each term in SQ 1 , we will have: 

- log (1 + Xf) + Xi = —Xi + —-Xf + Xi 

(1 + Xi) 2 

where X, lies between 0 and |X,;|: 

|X,| = |cxO + (l-c)xJf,|<|X,|< 7^ 


o<,< va.;{»± g. 


fory^u^ -^ 

Lemma 6. For ip ~ Poisson(Ai), there exists a number L > 0 
such that for all integers m > 1, 

EOifc-Aip] < ^ E[(j/i - Ai) 2 ]L m_ 2 m! 

Moreover, L = max(l, y/Xf). 
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Remark: The proof of this Lemma is partly provided in 
Ho) and is based on the fact that moment generating function 
for Poisson distribution with rate A, 


M{t) = exp (A(exp(i) — 1) — At) 


is an analytic function, which means that its Taylor series 
converges around t = 0 on an open set in EL Therefore, the k- 
th coefficient of Taylor series exists and is bounded. If A < 1, 
it could be shown that all Taylor coefficients of M(t) are less 
than or equal to If A > 1, we replace t by t/y/X. Then, 
all Taylor coefficients could be shown to be less than This 
completes the proof. □ 

Corollary 2. Let y i: ~ Poisson(Ai(w*)), 1 < i < n, and 
Xh be the harmonic average of Xi(~w*)’s. Then the following 
bound holds 


Pr 


1 

- 1 J/iAi(w*) 


1 | < 2 


i°g(2/cA 

n\ h ) 


>i-C 


for £ > 2 exp 


^^min min(l,A m j n ) 
4A h 


Proof Follows trivially from Lemma [ 6 j and the Bernstein 
inequality. □ 


C. Proof of Corollary 1 

Note that for any feasible w in the optimization, Ai(w) 
could be written as: 

Aj(w) = A 0)i + ajw = A 0 ,i + a| s)T w + a A s 

= (A 0 ,i + a A s) + a- 9)T w, 

where af' 1 is the i- th row of A g . Now assuming the new 
design matrix is A 9 and the base rates are Ao^ + sa A , we can 
repeat the proof of the theorem 1. This time, as A,, satisfies 
RE(0(l),fc) for n > Cklogp, we get 7 = 0(1). and a max 
has to be replaced by max(av,a A ), which is the maximum 
possible absolute value of A g entries. □ 


D. Proof of Theorem 2 

Without loss of generality we assume that the minimum 
element of A occurs in the fc-th column. First, using Gilbert- 
Varshamov bound, we obtain M > exp{(fc — l)/8} vectors 
r, e on the hypercube {0, l} fe such that the Hamming 
distance is bounded from below by (k — l)/4 0, i.e., 


For future reference, we note down £2 and £1 norms of 
pairwise difference of the w, ’s: 


Sij II 2 := 


|W* - Wj|| 2 = 





1 /(fe-l)0 mi nS 

c\ - 


Infill = ll W * - W jl|l > 


nr] 

k- 1 
4c 


nq 


We next compute the KL divergence for Poisson distributed 
random variables namely, y t ~ Pois(Ao.t + a7w.,) for different 
pairs i, j. It follows that, 

1 ” 

D KL (Pois(X 0tt + aj Wj)| |Pois(A 0 ,t + a7 w i)) 


t=l i,j 

1 

“ AP 


EEo ^') 1 ^ 1 


t=1 i,j 


a 7 S, 


t u ij 


^0 ,t + aj w j 


where we have assumed w.l.o.g that aj Sij > 0. Next note 
that. 


log 1 + 


aj 


< 


aj 


(A 0)t + aJ-Wj)J (Xoj + aJwj) 


Consequently, we obtain 

(a7*tf) 2 


< n ^ll A %lll < nr]\\8 t . 


Jill 


< 


2(A 0 , t d-a^Wj) 2a 

(*-i) 


Finally, from the generalized Fano bound Oil we know that 

/ + log 2 


inf sup E(||w — w*|| 2 ) > ^ min (l- 

W w *gg Z \ 

If we make c large enough (c > 34), we will have 
I + log 2 8 8 log 2 


log(M) 


log(M) 


< + 


k- 1 


< 0.7, 


for k > 9. As a result, we obtain 


• t ttji/ii *11 n — 0.3 /(k l)a m i n s 

inf sup E( w - w 2 ) > -r~\ - 

w w * 6 p 4c y my 


d H (Ti,Tj) > (k - l)/4 
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